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Abstract 

Background: MicroRNAs (miRNAs) are -22 nt long integral elements responsible for post-transcriptional control of 
gene expressions. After the identification of thousands of miRNAs, the challenge is now to explore their specific 
biological functions. To this end, it will be greatly helpful to construct a reasonable organization of these miRNAs 
according to their homologous relationships. Given an established miRNA family system (e.g. the miRBase family 
organization), this paper addresses the problem of automatically and accurately classifying newly found miRNAs to 
their corresponding families by supervised learning techniques. Concretely, we propose an effective method, 
miRFam, which uses only primary information of pre-miRNAs or mature miRNAs and a multiclass SVM, to 
automatically classify miRNA genes. 

Results: An existing miRNA family system prepared by miRBase was downloaded online. We first employed n- 
grams to extract features from known precursor sequences, and then trained a multiclass SVM classifier to classify 
new miRNAs (i.e. their families are unknown). Comparing with miRBase's sequence alignment and manual 
modification, our study shows that the application of machine learning techniques to miRNA family classification is 
a general and more effective approach. When the testing dataset contains more than 300 families (each of which 
holds no less than 5 members), the classification accuracy is around 98%. Even with the entire miRBase15 (1056 
families and more than 650 of them hold less than 5 samples), the accuracy surprisingly reaches 90%. 

Conclusions: Based on experimental results, we argue that miRFam is suitable for application as an automated 
method of family classification, and it is an important supplementary tool to the existing alignment-based small 
non-coding RNA (sncRNA) classification methods, since it only requires primary sequence information. 

Availability: The source code of miRFam, written in C++, is freely and publicly available at: http://admis.fudan.edu. 
cn/projects/miRFam.htm. 



Background 

Sequences of DNA, RNA and proteins are the funda- 
mental currency of modern biological research, which 
link the different levels of the biological hierarchy, from 
genes to 3D structures [1]. Common features of species 
and functionally important residues can be identified 
through sequence mining. RNA, which stores informa- 
tion like DNA and acts as an enzyme like proteins, may 
have supported cellular or pre-cellular life [2], and is 
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crucial to protein synthesis that plays a very important 
role in life. 

There are many RNAs with other roles in particular 
regulation of gene expression. Research shows that non- 
coding RNA genes produce a functional RNA product 
rather than a translated protein [3]. The most startling 
recent development in the non-coding RNA (ncRNA) 
field is the widespread importance of microRNA 
(miRNA). In the past six years, accompanied with the 
development of experimental [4,5] and computational 
[6-9] miRNAs detecting methods, the number of 
miRNA genes registered in miRBase [10] increased 
rapidly. We explored miRBase from version 5 to version 
15 and found that the number of known miRNAs 
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Figure 1 The explosion of miRNA genes (Sep. 2004 - Apr. 2010). MiRNAs registered in miRBase increased rapidly in recent years. Almost at 
the same time when we finalized this manuscript, the 16th version of miRbase was released on 10 September 2010. Here, the latest information 
is not shown. Similar information was also exhibited in [10]. 



increased rapidly during the last several years (Figure 1). 
A similar trend can also be seen in [10]. It can be 
expected that with the use of next-generation sequen- 
cing technology [11-13], more miRNA genes will be 
identified. MiRNAs [14], belonging to the family of 
small non-coding RNAs (sncRNAs), are endogenous in 
many animal and plant genomes, and are now recog- 
nized as one of the major regulatory gene families in 
eukaryotic cells [15]. They modulate diverse biological 
processes, including embryonic development, tissue dif- 
ferentiation, and tumorigenesis. MiRNAs inhibit transla- 
tion and promote mRNA degradation via sequence- 
specific binding to the 3'UTR regions of mRNAs [16]. 
Mature miRNAs are derived from longer precursors, 
each of which can fold into a hairpin structure that con- 
tains one or two mature miRNAs in either or both its 
arms [17]. The biogenesis of a miRNA in animals con- 
sists of two steps. In the first step, the primary miRNA 
(pri-miRNA), which is several hundred nucleotides long, 
is processed in the nucleus by a multi-protein complex 
containing an enzyme called Drosha to give rise to the 
~70 nt long miRNA stem-loop precursor (pre-miRNA), 
which is then exported to the cytoplasm. The second 
step takes place in the cytoplasm where the pre-miRNA 
matures into a -22 nt long miRNA:miRNA* duplex, 
with each strand originating from the opposite arm of 
the stem-loop [18]. Then, the miRNA strand of the 
miRNAimiRNA* duplex is loaded into a ribonucleopro- 
tein complex known as the miRNA-induced silencing 
complex (miRISC) [19]. To date, the miRNA* was 



thought to be peeled away and degraded. However, 
some studies indicate that miRNA* is also sorted into 
Argonauts and might have a regular function in Droso- 
phila melanogaster [20]. 

MiRBase is the central online repository of miRNA 
nomenclature, sequence data, annotation and target pre- 
diction, which first appeared in Oct. 2002 [21]. Release 
15 contains 14197 miRNA loci from 66 species. From 
version 5.0, miRBase began to classify miRNAs into dif- 
ferent families. 

This kind of information was stored in miFam.dat, 
which was freely available online http://www.mirbase. 
org. These families were prepared manually. Essentially, 
it was done by using the single-linkage method to clus- 
ter the precursor sequences based on BLAST hits, and 
then adjusting (merging and/or splitting) manually the 
clustered families by multiple sequence alignment. The 
aim is to put miRNAs that have a common ancestor 
into the same family. 

Rfam [22] is another well known RNA database. It con- 
tains a collection of multiple sequence alignments and 
covariance models (CMs) that represent ncRNA families. 
The primary aim of Rfam is to annotate new members of 
known RNA families on nucleotide sequences, particu- 
larly complete genomes, by using sensitive BLAST filters 
in combination with CMs. Both primary sequences and 
base-paired secondary structures are used to establish 
and annotate families. Release 10 contains 1446 families, 
including 453 miRNA families. But the quality of multi- 
ple sequence alignments and secondary structures is still 
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a challenge for Rfam. Furthermore, Rfam requires a lot of 
computing resources to establish the family structure, 
which is time consuming, especially when the number of 
sequences is huge. 

Since pre-miRNAs can form stable hairpins, this speci- 
fic structural property has been used to cluster or classify 
them by some ncRNA clustering or classification meth- 
ods [23,24]. Will et al. [23] presented a structure-based 
clustering approach, LocARNA (local alignment of 
RNA), which is capable of extracting putative RNA 
classes from genome-wide survey of structured RNAs. 
The performance of LocARNA relies on the prediction 
accuracy of RNA secondary structures. However, current 
RNA secondary structure energy models are not always 
able to predict native RNA structures, even for short 
molecules [25]. Furthermore, hairpin secondary structure 
might be less effective in miRNA classification since all 
miRNAs can fold back into this type of structure. 

By far, multiple sequence and/or structure alignments 
are still widely used in ncRNA clustering and classifica- 
tion field. But neither of them has completely solved the 
ncRNA clustering or classification problem, especially 
for miRNAs. Not to mention effectiveness, only effi- 
ciency is still far from being satisfactory, since these 
methods could be very time-consuming when applied to 
large-scale validation of miRNA families. 

As we know, miRNAs are highly conserved in not 
only their primary sequences but also their secondary 
structures. And miRNAs in the same family always have 
consensus secondary structures and similar functions 
[26]. Hence, it is biologically significant to subsume 
miRNAs with consensus second structures and similar 
functions to the same family. In this paper, based on the 
family system provided by miRBase, we explored super- 
vised learning techniques to accurately and automati- 
cally classify pre-miRNA or mature sequences. 

Concretely, we propose an effective alignment free 
model named miRFam to classify newly detected miR- 
NAs. First, it extracts «-grams as features from primary 
sequences. Then, these K-gram features are integrated 
into one feature vector by concentration. Finally, it trains 
a multiclass SVM classifier sVM multiclass based on the 
families prepared by miRBase to classify new pre- 
miRNA or mature sequences whose families are not yet 
known. 

As a powerful tool, miRFam aims to classify new miR- 
NAs into their corresponding families. It can not only 
support researchers who just obtained novel miRNAs 
computationally or experimentally to go on exploring 
the function of these miRNAs, but also enhance the uti- 
lity of miRBase by providing higher automation and 
accuracy for miRNA classification. When measuring 
sequence similarity, unlike BLAST [27] or other 
BLAST-based methods, miRFam uses shorter sequence 



segments, thus it has a much smaller search space, 
which allows it to run faster. As the first miRNA- 
oriented sncRNA family classification method, miRFam 
has several advantages: (1) Only primary information of 
miRNAs is required, no other assumption (e.g., common 
secondary structures within a family or limitation of 
sequence length) is imposed. (2) Compared with multi- 
ple sequence alignment (MSA), miRFam is more effi- 
cient and accurate. To classify ~ 10,000 pre-miRNA 
sequences, MSA will cost several hours while miRFam 
consumes only several minutes. (3) miRFam is insensi- 
tive to sequencing error and the exact position of pre- 
miRNA in pri-miRNA. The change of several bases has 
very little effect on the feature vectors. 

Results 

In order to evaluate the miRFam method, we designed a 
pipeline that is illustrated in Figure 2. The experiments 
were arranged into three groups: single family tests, 
multi-family tests and application-oriented large-scale 
miRBase family tests, which were conducted on a num- 
ber of datasets whose details are presented in the meth- 
ods section. We started with single family tests, then 
multi-family tests and finally application-oriented large- 
scale miRBase family tests. Single family tests are classi- 
cal binary classification, while the other tests are multi- 
class classification. With miRFam, users can conveni- 
ently choose different combinations of M-grams. Accord- 
ing to our experience, unigrams, bigrams, trigrams and 
tetragrams are enough to classify all miRNAs registered 
in miRBase. For single family and multiple-family tests, 
even only unigrams, bigrams and trigrams are enough 
to achieve satisfactory classification performance. All 
experimental results were achieved by 5-fold cross vali- 
dation. That is, each dataset is first randomly divided 
into five equally-sized partitions, each of which contains 
the same ratio of positive and negative examples. And 
then any four partitions are merged as the training set 
to train miRFam, which is further evaluated with the 
fifth data partition. This procedure is repeated five times 
with different combinations of training and testing sets, 
and the final classification performance is obtained by 
averaging the five tests' results. 

Single family tests 
Synthetic dataset test 

The three biggest families in miRBase 14 are let-7, mir- 
17 and mir-9, which contain 208, 154 and 134 members, 
respectively. These three families were merged with 
three synthetic datasets Rl, R2 and R3, respectively. 
miRFam was then tested on these three merged data- 
sets, which are denoted as "let-7+Rl", "mir-17+R2" and 
"mir-9+R3". Our aim is to show that miRFam can dis- 
tinguish real pre-miRNAs from synthetic random 
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Figure 2 The experimental pipeline. In order to show the discriminative power of miRFam, we designed a series of experiments including 
single family tests, multi-family tests, and tests on large-scale miRBase families. All these experiments were carried out by 5-fold cross validation. 
Details of datasets and features were also shown. 



sequences with similar base compositions. As expected, 
the combination of M-gram and multiclass SVM algo- 
rithm can precisely classify real miRNAs and random 
sequences. Experimental results are presented in Table 
1, from which we can see that the accuracy is higher 
than 98.5% for all three families. Next, we took "let-7 
+R1", which gets the middle accuracy, as an example for 
further analysis. In 5-fold cross validation, only four 
sequences (MI0010673, MI0010668, RANDOM195, 
RANDOM198) were misclassified. MI0010673 and 
MI00 10668 were first discovered from Schistosoma japo- 
nicum by cloning and sequencing a small (18-26 nt) 
RNA cDNA library from adult worms [28]. We sub- 
mitted these two real miRNA sequences to Rfam (ver- 
sion 10.0) separately, but no hit was obtained. We then 
turned to Clustal W2 to generate the MSA with default 
parameters and viewed the guide tree by Jalview2.5 (see 
Figure SI in additional file 1). We found that 
MI0010673 and MI0010668 were located in separate 

Table 1 Results of single family experiments 





experiment 


SE(%) 


SP(%) 


Acc(%) 




let-7+R1 


99.50 


99.52 


99.51 


R* 


mir-17+R2 


100.0 


100.0 


100.0 




mir-9+R3 


98.58 


98.46 


98.52 




let-7+S 


99.02 


99.69 


99.42 


S 


mir-17+S 


99.33 


99.69 


99.57 




mir-9+S 


100.0 


99.38 


99.56 



* Only trigram and bigram features are considered in these experiments. 



branches, while RANDOM195 and RANDOM 198 lied 
in the nearby branches. Results showed that these syn- 
thetic sequences were so similar to the real ones that 
they were indistinguishable by using miRFam and MSA. 
In order to give a more intuitive picture of these data- 
sets, we calculated the Euclidean distance (ED) between 
the real and synthetic cluster centers, and we found that 
the larger the Euclidean distance is, the better the classi- 
fication performance is (see Figure S2 in additional file 
1). 

Real dataset test 

MiRNAs and snoRNAs are two classes of small non- 
coding regulatory RNAs, which have been extensively 
investigated in recent years. Although their functions in 
the cell are distinct, they share interesting genomic simi- 
larities. Recent sequencing projects have identified pro- 
cessed forms of snoRNAs that resemble miRNAs. A 
comparison of the genomic locations of reported miR- 
NAs and snoRNAs reveals an overlap of some specific 
members of these two classes [29,30]. Keeping this in 
mind, we evaluated miRFam on another three datasets, 
which were constructed by merging dataset S with the 
families let-7, mir-17 and mir-9, and were denoted as 
"let-7+S", "mir-17+S" and "mir-9+S", respectively. The 
results are presented in Table 2, which shows that miR- 
Fam can easily distinguish miRNAs from snoRNAs, and 
the accuracies are higher than 99%. 
The effect of concentration factor 

In this paper, we introduced the concentration factor to 
weight the features of family vectors (see Equ. 2). 
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Table 2 Results of different combinations of n-gram 



types 



Group 


Acc 
(trigram) 


Accftri- 
Sibigram) 


Acc(tri-, bi- 
&unigram) 0 


Accftri-, bi- 
&unigram) b 


T 20 


90.67 


96.21 


68.90 


96.76 


G, 


93.61 


98.40 


87.63 


98.86 


G 2 


87.62 


99.01 


87.74 


99.01 


Total c 


85.08 


93.48 


63.75 


93.62 



a Results of miRFam with unigram, bigram and trigram, without concentration 
factor. ^Results of miRFam with unigram, bigram and trigram, with 
concentration factor. Combination of T 20 , and G 2 . All results are 
percentiles. 



Intuitively, the longer fragments of sequences should be 
more informative than the shorter ones. For example, 
with some exceptions [31], a triplet codon in a nucleic 
acid sequence specifies a single amino acid. And here, a 
trigram is exactly a triplet. Thus, in representing miR- 
NAs sequences, the longer n-grams should outweigh the 
shorter ones. In what follows, we will see whether our 
concentration factor weighting scheme conforms to the 
above intuition and observation, by checking the centers 
(before and after weighting) of the three families (let-7, 
mir-17 and mir-9) and dataset S (the mixed snoRNA 
class). 

Figure 3(A) and Figure 3(B) are the center vectors 

before and after weighting (evaluated by |r and ^ x C, 

respectively) of four families. Roughly, before the 
weighting, trigrams have apparently smaller values than 
bigrams and unigrams. But after the weighting, trigrams 
get substantially enhanced. Furthermore, we calculated 
the variance of each feature's value among four families 
before and after weighting, the results are illustrated in 



Figure 3(C) and Figure 3(D). We can see that after 
weighting, the variances of trigrams are relatively 
enlarged, while the variances of bigrams and unigrams 
are substantially restrained. That is to say, our weighting 
scheme makes the trigram feature values of different 
families be more discrepant, which will benefit the clas- 
sification of these families. Additionally, we evaluated 
the effect of concentration factor on multi-family data- 
sets (Table 2). Without the concentration factor, more 
than 10% classification accuracy was lost on all datasets. 
MiRFam performed even worse when only trigrams 
were used. 

In summary, the analysis on the feature vectors of dif- 
ferent families shows that the concentration factor 
weighting scheme can enhance the trigrams while 
restraining the bigrams and unigrams, which is reason- 
able and consistent to the intuition and observation. 
Most importantly, our extensive classification experi- 
ments in this and the later sections also show indirectly 
that the weighting scheme is effective. 

Multi-family tests 

As mentioned before, with the development of powerful 
deep sequencing technology, more miRNA genes will be 
identified in the future. But the number of real miRNAs 
in a certain genome is still unknown. Thus, a major 
concern is how well miRFam will perform if only a 
small number of known miRNAs are available for some 
certain families and species. In the previous single family 
tests, we have employed three types of «-grams (uni- 
grams, bigrams and trigrams) as features, so one natural 
question is how the different combinations of these 
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Figure 3 Center vectors comparison among three miRNA families (let-7, mir-17 and mir-9) and the dataset S (snoRNAs) All horizontal 
axes are the n-grams arranged in the order from trigrams to bigrams and unigrams. (A) Family centers before weighting; (B) Family centers after 
weighting; (C) The variances of n-gram feature values among the four families before weighting; (D) The variances of n-gram feature values 
among the four families after weighting. 
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types of n-grams will impact miRFam's performance. 
Furthermore, as mature miRNAs and hairpin sequences 
are somehow a little different, it occurs to us whether 
miRFam will perform differently on them. To answer 
these questions, we tested miRFam on three multi- 
family datasets constructed from miRBase (version 14) 
according to their family members. T 20 contains the top 
20 biggest families in miRBase (version 14), while Gi 
and G 2 contain those families whose members are 
around 40 and 20, respectively. Here, the numbers 40 
and 20 are randomly selected. Performance measure- 
ments like sensitivity and specificity are usually defined 
for binary classification. Here we actually deal with 
multi-class (i.e. multi-family) classification, so we use 
accuracy (Acc) as the performance indicator. 
The impact of training dataset size 

All 2198 precursor sequences in T 20 were divided into 
ten equally-sized partitions. First, we randomly took one 
partition (10%) of the sequences as the training set, the 
remaining nine partitions (90%) as the testing set. Then, 
we increased the training set by one partition (10%), and 
accordingly the testing set was reduced by one partition 
(10%). This process continues iteratively till half of T 2 o 
was for training and the other half for testing. At each 
round, miRFam was trained and tested, and its perfor- 
mance is evaluated by cross validation. As shown in Fig- 
ure 4, the accuracy is 56.01% when only 10% of T 20 is 
used for training. With the increase of training samples, 
the accuracy stably goes up. When the training set and 
the testing set are of equal size, the accuracy of miRFam 




0 10 20 30 40 50 60 70 80 90 



Percentage of tarining sequences (%) 

Figure 4 Classification performance vs. the size of training 
dataset. We used T 20 to show the impact of training dataset size. 
At the beginning, only 10% of 2198 sequences in T 2 o were treated 
as training samples while others (90%) were used to test miRFam. 
At each round, we increased the training set by one partition (10%), 
and accordingly the testing set was reduced by one partition (10%). 
This process continued iteratively till half of T 20 was for training and 
the other half for testing. The result of normal 5-fold crass validation 
is also shown. 



is nearly 90%. For a normal 5-fold cross validation on 
the whole dataset, i.e, training miRFam with 80% sam- 
ples and testing it with the remaining 20%, the accuracy 
is 96.76%. 

The impact of the combination of different n-gram types 

Here, we examine how classification performance will be 
impacted by the different combinations of unigrams, 
bigrams and trigrams on these multi-family datasets 
(Table 2). Actually, we also test miRFam with tetra- 
grams, the results are presented in Additional file 1, 
Table S3. 

We found that miRFam performs better when more 
types of «-gram features were used. Even when only the 
trigrams were used to classify miRNAs, the accuracy is 
around 90%. For the G 2 dataset, when features of uni- 
gram, bigram and trigram types were all included, the 
accuracy was surprisedly more than 99%. Further 
exploring the classification results, we also found that 
some abnormal sequences with noise bases (not A, U, G 
and C) were also classified correctly in 5-fold cross vali- 
dation (sequences are listed in Table S2 in Additional 
file 1), which means that miRFam is insensitive to base 
changes, such as single-nucleotide polymorphism (SNP) 
or sequencing error. 

In addition, by transforming pre-miRNA sequences to 
feature vectors, both normal and abnormal sequences 
were handled in a similar process, thus avoiding the 
cumbersome addition, deletion and modification opera- 
tions used in MSA. 
Test with mature miRNAs 

It has been shown that miRNAs are modified after 
maturation [32]. So, we also evaluated miRFam on 
mature miRNAs contained in these multi-family datasets 
(Table 3). Comparing to the results in Table 3, it can be 
seen that in most cases, miRFam performs better with 
mature miRNAs than with all miRNAs, which indicates 
that miRFam can accurately classify both hairpin and 
mature sequences. In fact, for a mature miRNA, the 
seed region is always much more functional than the 
other regions, it is the core functional region of its pre- 
cursor. Thus, miRBase also prefers to put miRNAs with 
similar mature sequences into the same families. That is 
the reason why miRFam can achieve better performance 



Table 3 Results on mature miRNAs 



Group 


Families 


Members* 


Acc(tri-, bi-&unigram, %) 


T 20 


20 


1529 


96.80 


G, 


10 


351 


97.71 


G 2 


10 


162 


99.38 


Total 


40 


2042 


95.03 



* Two reasons why the numbers of mature sequences in multi-family datasets 
are less than that in hairpins. First, different pre-miRNAs may generate similar 
mature miRNAs. Second, some pre-miRNAs contain several mature miRNAs, 
but only one is considered. 
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with shorter maturity. It is also more efficient to classify 
mature miRNAs than to classify pre-miRNAs, since 
matures usually contain no more than 30% bases of 
their precursors. 

Application-oriented large-scale families tests 

A good model should not be data specific, instead it 
should be generally applicable. Although miRFam can 
achieve excellent results in single family tests and multi- 
family tests, what we really care about is its practical 
application performance. Based on this consideration, we 
evaluated miRFam on large-scale families from miRBase 
(version 14 and 15). Results are presented in Table 4. 

Since 5-fold cross validation was employed, families 
that contain less than 5 members were not considered 
at first. A detailed family distribution in miRBase could 
be found in Figure S3 in the additional file 1. From 
miRBase vl4, the 334 families that contain no less than 
5 members were selected, which hold 87.49% (7797/ 
8912) pre-miRNA sequences of the whole database. On 
this dataset, miRFam achieved an accuracy of 98.18%. 

When we were preparing this manuscript, miRBase 
(version 15) was released in April 2010. This is a signifi- 
cant update, with over 3000 new hairpin sequences and 
more than 4000 new mature sequences. From miRBase 
vl5, 398 families were selected, each of which contains 
no less than 5 members. These families constitute 
84.38% (9379/11115) hairpin sequences in the whole 
database. Even with such large-scale families, miRFam 
still got an accuracy of 97.97%. 

When dealing with miRBase vl5, there are still 1736 
pre-miRNAs distributed in 658 families that were not 
processed (see Figure S3). Among them, 351 families 
have only 2 members. In the final experiment, we tested 
miRFam on the whole 1056 families in miRBase vl5. 
For those families with less than 5 members, we ran- 
domly chose one member as the testing sample, and the 
remaining as training samples, miRFam still obtained an 
accuracy of 90.66%, which was a surprisingly satisfactory 
result, considering that classifying a dataset with a large 
number of classes and the extremely uneven distribu- 
tions of members in these classes is a well-recognized 
challenging task. 



Table 4 Performance of large-scale miRBase families test 





miRBase14 




miRBase15 


Family number 


334 a 


398° 


1056 6 


MiRNA number 


7797 


9379 


11115 


Accuracy (%) c 


89.21 


88.91 


85.09 


Accuracy (%) d 


98.18 


97.97 


90.66 



a Families in miRBase whose members are no less than 5. 

h All families in miRBase 15 are used. 

c miRFam results with uni-, bi- and trigram features. 

d miRFam results with uni-, bi-, tri- and tetragram features. 



Discussion 

Effectively classifying newly detected miRNAs to their 
corresponding families is helpful for their further func- 
tional analysis. However, only a few works have been 
done to address this issue, which is far from being estab- 
lished. Unlike existing alignment-based sncRNA cluster- 
ing or classification methods [23,33,34], which can also 
be used to cluster or classify miRNAs, the proposed miR- 
Fam bases on supervised learning techniques, which is 
more general and effective. It does not require sequence- 
or structure-based alignment, thus it is free from the dif- 
ficulty of choosing multiple parameters used in the align- 
ment-based methods, and is also free from the quality 
issue of miRNA secondary structure prediction. Cer- 
tainly, miRFam is not completely parameter-free, it still 
has to set two parameters, i.e., the feature vector length / 
and the trade-off between training error and margin c. 
Another advantage of the miRFam method is its effi- 
ciency, especially when the number of sequences is huge. 
Furthermore, miRFam can achieve satisfactory classifica- 
tion performance over the family system prepared by 
miRBase. Of all predictions made by miRFam, the accu- 
racy is beyond 90%. Therefore, it can be used to replace 
the manual modification, which will greatly save time. 

Most known miRNA sequences are evolutionary con- 
served [35], miRNA families may have consensus sec- 
ondary structures [26], and the microRNA-target 
relationships are also conservative [36]. As people's 
interest in the miRNA world continuously grows, more 
and more datasets are going to appear. Correspondingly, 
there is an urgent need to classify the newly discovered 
miRNAs into their corresponding families according to 
sequence and/or structure similarities. With correct 
family classification, it is easier to elucidate the struc- 
tures and functions of the new sequences, by using mul- 
tiple sequence alignments. Apparently, more in-depth 
information can also be available, such as SNPs within 
pre-miRNAs and mature miRNAs [37]. 

One potential limitation of the proposed approach is 
that it relies on a prepared family classification struc- 
ture. Actually, this is a common problem with classifica- 
tion - a supervised machine learning approach, and the 
quality of training sets significantly influences classifica- 
tion accuracy. To overcome this limitation, we can turn 
to clustering analysis, which is an unsupervised learning 
approach that can automatically group the miRNA 
sequences into different categories based on their char- 
acteristics of sequences and/or structures. We keep this 
issue as our future work. 

Conclusions 

Sequence alignments are useful for the analysis of geno- 
mic data. For example, miRNA genes in newly sequenced 
organism can be detected based on their homology to 
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genes in related and well-studied species [4,38]. Once 
homologous genes are detected, one can perform a MSA 
with the hope of establishing miRNA families. However, 
MSA is time consuming in doing this work, different 
MSA algorithms may build quite different alignments, 
and choosing an appropriate alignment algorithm is cru- 
cial to the performance of family classification. 

In this article, we developed a new approach miRFam 
to accurately and automatically classify miRNA precur- 
sors by using K-grams and a multiple-class SVM classi- 
fier. To evaluate the miRFam method, we designed a 
pipeline, including single family tests, multi-family tests 
and large-scale families tests. Based on the experimental 
results, the following conclusions could be drawn: 

1. miRFam can effectively distinguish synthetic ran- 
dom sequences and similar snoRNA sequences from 
real pre-miRNA sequences (Table 1). 

2. Even with a small number of training samples, 
miRFam can still achieve a high accuracy. And with 
more types of «-gram features, miRFam can perform 
better (Table 2 & Figure 4). 

3. Both precursors and mature miRNAs can be used 
to infer miRNA families. With shorter mature 
sequences, miRFam can achieve better classification 
result (Table 3). 

4. When the dataset contains more than 300 families 
and each family holds no less than 5 members, the 
classification accuracy is around 98%. Even with the 
entire miRBase (version 15, 1056 families and more 
than 650 of them hold less than 5 samples), the 
accuracy surprisedly reaches 90% (Table 4). 

In summary, we proposed the first supervised learning 
based approach miRFam to automatically assign miRNA 
precursors to their corresponding families with high 
accuracy. It can be useful to help family classification, 
especially in the applications that previously have been 
done manually, such as miRBase. Additionally, due to its 
robustness, miRFam can be used in a wide range of sce- 
narios, as long as an existing family assignment informa- 
tion is available. Certainly, its performance depends on 
the existing family assignment information. However, as 
there is more and more study on miRNA, it is foresee- 
able that more miRNAs will be identified and registered 
in miRBase. Such a situation will certainly favor the 
existence and utilization of the miRFam method. In 
return, miRFam will also contribute a lot to the efficient 
exploration of these newly discovered miRNAs. 

Methods 

Datasets 

In this work, we constructed several datasets using data 
from miRBase and Rfam. These datasets were divided 



into three categories: single family datasets, multi-family 
datasets and large-scale family datasets. To facilitate the 
description, we used some notations to represent the 
datasets of the first two categories. These notations are 
summarized in Table 5. 

We first ranked miRNA families in miRBase according 
to the number of members contained in each family. R 
contains three subsets Rl, R2 and R3, corresponding to 
the three biggest families in miRBase vl4 (let-7, mir-17 
and mir-9). Rl, R2 and R3 were constructed by rever- 
sing the original pre-miRNA sequences in let-7, mir-17 
and mir-9 with squid [39], respectively. S was con- 
structed by mixing SNORA26 and SNORA33 down- 
loaded from Rfam vlO.O. 

SNORA26 (RF00568) is a member of the H/ACA class 
of small nucleolar RNAs, while SNORA33 (RF00133) is 
a member of the C/D box class. After being filtered to 
less than 90% identity, they contain 195 and 122 
sequences, respectively. Three multi families datasets 
(T 2 o, Gii G 2 ) were constructed from miRBase vl4 based 
on the result of family ranking. The biggest family in G! 
is mir-33 containing 47 members, and the smallest 
family is mir-26 containing 41 members. While the big- 
gest (smallest) families in G 2 is mir-315 (mir-320), con- 
taining 21 (20) miRNAs (Additional file 1, Table SI). 

Feature vectors 

In this paper, we treat family establishment as a classifi- 
cation problem. The first step is to transform miRNA 
sequences to numeric vectors, which are usually called 
feature vectors. Here, w-grams [40] are used as features 
of miRNA sequences. 
n-grams 

An M-gram is a subsequence consisting of n spatially 
consecutive items from a given sequence. The items in 
this study are pre-miRNA bases (A,C,G and U). A n- 
gram of size 1 (i.e. n = 1) is referred to as a "unigram", 
size 2 (n = 2) is a "bigram", size 3 (n = 3) is a "trigram", 
size 4 is a "tetragram", and size 5 or more (i.e. n > 5) is 



Table 5 Notations of datasets 





notation 


description 


Single 


R° 


reverse sequences of the biggest three miRNA 


family 




families 




S 


combination of SNORA26 and SNORA33 from 






Rfam 10.0 




T 20 


20 families with the largest members, ANM b = 






109.9 


Multi 


G1 


10 families selected from miRBase14, ANM b = 


families 




43.8 




G 2 


10 families selected from miRBase14, ANM b = 






20.2 


" R1 - let-7; R2 


- mir-17; R3 - mir-9. 



b ANM - Average Number of Members. 
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simply called a "«-gram". In the sequel, we also call uni- 
grams, bigrams, trigrams and tetragram as type 1, 2, 3 
and 4 n-grams, and so on. «-grams can be used for effi- 
cient approximate matching. By converting a miRNA 
precursor to a set of w-grams, it can be embedded into 
a vector space, thus allowing a sequence to be compared 
with others in an efficient manner. Here, we select uni- 
grams, bigrams, trigrams and tetragram as features. 

To extract «-grams, we use a window of size w that 
slides on pre-miRNA sequences from 5' to 3'. At each 
position on a sequence, the subsequence of length n 
covered by the sliding window corresponds to a M-gram. 
As the window slides forward, the occurrence frequency 
t of each encountered w-gram is recorded. 
Concentration 

Since RNA sequences contain only the four bases A, U, 
G and C, we have 4 unique unigrams, 4 2 unique 
bigrams, 4 3 unique trigrams and 4 4 unique tetragrams. 
In order to combine these different features into one 
feature vector, we introduce a concentration factor. 
Denote the number of unique w-grams of type i as A/,. 
The concentration of type i is the ratio of AT,over the 
total number of unique «-grams. That is, 



Q 



i = 1, 2,3 A 



(1) 



For example, the trigram (type 3) has 4 3 unique di- 
grams. The total number of unique w-grams used in this 
study is 340 (4+16+64+256), therefore trigram's concen- 
tration is C tri = 64/340 = 0.188. Then, the elements of a 
feature vector is calculated by (2). 



fi = -x Q, 



j € Z and 1 < j < 340 



(2) 



Above, ti is the occurrence frequency of a certain 
unique w-gram of type i, and T t is the total occurrence 
frequency of all unique w-grams of type i. A feature vec- 
tor contains 340 dimensions, each of which corresponds 
to a unique w-gram of a certain type i (i = 1, 2, 3 and 4). 
Within a vector, the dimensions are arranged in the 
order of tetragrams, trigrams, bigrams and unigrams. 
The sum of all dimensional values of a feature vector is 1. 



approach such as AdaBoost.M2 and AdaBoost.MH 
[42,43]. However, the dominate approach to the multi- 
class problem is multiclass SVM. One of the most 
widely-used multiclass SVM methods is one-versus-all. 
In this method, M binary classifiers are constructed. 
The i-th classifier's output function F,is trained by using 
the examples from class i as positives and the examples 
from all other classes as negatives. For a new example x, 
the one-versus-all SVM strategy assigns it to the class 
with the largest value of Fj[44]. 

In this study, we use the popular multiclass SVM 
package sVM multiclass (version 2.20). sVM multiclass uses 
the multi-class formulation described in [45], and is 
optimized so that it is very fast in linear cases [46] . 

MSA implementation and visualization 

Multiple sequence alignment is done by Clustal W (ver- 
sion 2.0) [47]. The tree visualization of MSA results is 
achieved by Jalview (version 2.5) [1]. These tools are 
also used by EMBL-EBI online. 

Evaluation 

The most straightforward way to evaluate the perfor- 
mance of a classifier is based on the confusion matrix 
analysis. With this matrix, it is possible to evaluate a 
number of widely used metrics for measuring the per- 
formance of a learning system. Here, we use sensitivity 
(SE), specificity (SP) and accuracy (Acc) to evaluate miR- 
Fam. They are defined as follows: 



SB ■■ 



TP 



TP + FN 



SP -■ 



TN 
TN + FP' 



Acc ■■ 



TP + TN 



TP + FP + TN + FN 



(3) 



Here, TP, FP, TN and FN are the numbers of true 
positive predictions, false positive predictions, true nega- 
tive predictions and false negative predictions, 
respectively. 
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Multiclass SVM 

Binary classification using support vector machine 
(SVM) is a well developed technique. However, due to 
performance reasons, using a single SVM formulation 
directly to solve the multiclass problem is usually 
avoided. A better approach is to use a combination of 
several binary SVM classifiers to solve the multiclass 
problem. Typical algorithms of multiclass learning 
include the multiclass extensions to decision tree learn- 
ing [41] and various specialized versions of the boosting 



Additional material 



Additional file 1: Supplement. We collect all supplementary tables 
and figures in this file. The detailed family information and abnormal 
sequences contained in three multi-family datasets (T 20 , G 1( and G 2 ) can 
be found in Additional file 1, Table S1 and S2, respectively. Results of 
multi-family test with tetragram features are summarized in Additional 
file 1, Table S3. Figure SI and S2 in Additional file 1 are supplied to 
support our analysis in Section "Synthetic dataset analysis", while Figure 
S3 in Additional file 1 shows the family distribution in miRBase (version 
14 and 15) according to family member. 
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